fast approximate natural gradient descent
Fast Approximate Natural Gradient Descent in a Kronecker Factored Eigenbasis
For models with many parameters, the covariance matrix they are based on becomes gigantic, making them inapplicable in their original form. This has motivated research into both simple diagonal approximations and more sophisticated factored approximations such as KFAC (Heskes, 2000; Martens & Grosse, 2015; Grosse & Martens, 2016). In the present work we draw inspiration from both to propose a novel approximation that is provably better than KFAC and amendable to cheap partial updates. It consists in tracking a diagonal variance, not in parameter coordinates, but in a Kronecker-factored eigenbasis, in which the diagonal approximation is likely to be more effective. Experiments show improvements over KFAC in optimization speed for several deep network architectures.
Reviews: Fast Approximate Natural Gradient Descent in a Kronecker Factored Eigenbasis
Summary The paper describes a generic 2nd order stochastic optimisation scheme exploiting curvature information to improve the trade-off between convergence speed und computational effort. It proposes an extension to the approximate natural gradient method KFAC where the Fisher information matrix is restricted to be of Kronecker structure. The authors propose to relax the Kronecker constraint and suggest to use a general diagonal scaling matrix rather than a diagonal Kronecker scaling matrix. This diagonal scaling matrix is estimated from gradients along with the Kronecker eigenbasis. Quality The idea in the paper is convincing and makes sense.
Fast Approximate Natural Gradient Descent in a Kronecker Factored Eigenbasis
George, Thomas, Laurent, César, Bouthillier, Xavier, Ballas, Nicolas, Vincent, Pascal
For models with many parameters, the covari- ance matrix they are based on becomes gigantic, making them inapplicable in their original form. This has motivated research into both simple diagonal approxima- tions and more sophisticated factored approximations such as KFAC (Heskes, 2000; Martens & Grosse, 2015; Grosse & Martens, 2016). In the present work we draw inspiration from both to propose a novel approximation that is provably better than KFAC and amendable to cheap partial updates. It consists in tracking a diagonal variance, not in parameter coordinates, but in a Kronecker-factored eigenbasis, in which the diagonal approximation is likely to be more effective. Experiments show improvements over KFAC in optimization speed for several deep network architectures.